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Abstract 



We present a parallel algorithm for solving backward stochastic differential equations 
(BSDEs in short) which are very useful theoretic tools to deal with many financial 
pro blems ranging from option pricing option to risk management. Our algorithm based 
on iGobet and Labartl ( 2010T ) exploits the link between BSDEs and non linear partial 



differential equations (PDEs in short) and hence enables to solve high dimensional non 
linear PDEs. In this work, we apply it to the pricing and hedging of American options 
in high dimensional local volatility models, which remains very computationally 
demanding. We have tested our algorithm up to dimension 10 on a cluster of 512 CPUs 
and we obtained linear speedups which proves the scalability of our implementation. 

Keywords : backward stochastic differential equations, parallel computing, Monte- 
Carlo methods, non linear PDE, American options, local volatility model. 

1 Introduction 

Pricing and hedging American options with a large number of underlying assets is a 
challenging financial issue. On a single processor system, it can require several hours of 
computation in high dimensions. Recent advances in parallel computing hardware 
such as multi-core processors, clusters and GPUs are then of high interest for the 
finance community. For a couple of years, some research te ams have been tackling the 



parall elization of numerical algorithms for option pricing. iThulasiram and Bondarenko 



(|2002T l developed a parallel algorithm using MPI for pricing a class of multidimensional 
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financial derivatives using a binomial lattice approach. iHuang and Thularisaml (|2005l ) 
presented algorithms for pricing American style Asian options using a binomial tree method. 
Concerning the parallelization of Monte-Carlo methods for pricing multi-dimensional 



Dune Doan et al. 


( 


2QW). 1 


the 


Ibahez and Zapatero 



We refer to Toke and Girard 



exercise boun dary. A GPGPU approach based on quantization techniques has recently been 
developed by Pages and Wilbertz ( 201ll ). 



Our approach is based on solving Backwar d Stochastic Diff e rential Equations (BSDEs in 
short). As explained in the seminal paper by IeI Karoui etaD (jl997bh . pricing and hedging 
European optio n s in lo cal volatility models boil down to solving standard BSDEs. From 
El Karoui et all (|l997al l. we know that the price of an American option is also linked to a 



particular class of BSDEs called reflect ed BSDEs . Seve ral sequential algorithms to solve BS- 
DEs can be found in the literature. Ma et al. 1 (Il994h adopted a PDE approach, whereas 
Bouchard and Touzil (j2004h and iGobet et al.1 (|2005h used a Monte-Carlo approach based 



on the dynamic programm ing e quation. The Monte-Carlo approach was also investigated 
by Ballv and Pages ( 20031 ) and Delarue and Menozzil ( 20061 ) who applied quantization tech- 
niques to solv e the dynamic prog r ammi ng equation. Our approach is based on the algorithm 
developed by Gobet and Labart ( 2010l ) which combines Picard's iterations and an adaptive 
control variate. It enables to solve standard BSDEs, ie, to get the price and delta of Euro- 
pean options in a local volatility model. Compared to the algorithms based on the dynamic 
programming equation, ours provides regular solutions in time and space (which is coherent 
with the regularity of the option price and delta) . T o apply it to the pricing and hedging of 
American options, we use a technique introduced by lEl Karoui et al.l (|1997al ). which consists 
in approximating a reflected BSDE by a sequence of standard BSDEs with penalisation. 

The paper is organized as follows. In section [21 we briefly recall the link between BSDEs 
and PDEs which is the heart of our algorithm. In section O we describe the algorithm and 
in Section 2] we explain how the parallelization has been carried out. Section describes 
how American options can be priced using BSDEs. Finally, in Section El we conclude the 
paper by some numerical tests of our parallel algorithm for pricing and hedging European 
and American basket options in dimension up to 10. 



1.1 Definitions and Notations 

• Let Ct' be the set of continuously differentiable functions 4> : (t, x) G [0, T] x M. d with 
continuous and uniformly bounded derivatives w.r.t. t (resp. w.r.t. x) up to order k 
(resp. up to order I). 

• Cp denotes the set of C fc_1 functions with piecewise continuous k th order derivative. 

• For a g]0, 1], C k+a is the set of C k functions whose k th order derivative is Holder 
continuous with order a. 
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2 BSDEs 

2.1 General results on standard BSDEs 

Let (fl, J-, P) be a given probability space on which is defined a g-dimensional standard Brow- 
nian motion W, whose natural filtration, augmented with P-null sets, is denoted (J r t)o<t<r (T 
is a fixed terminal time). We denote (Y, Z) the solution of the following backward stochastic 
differential equation (BSDE) with fixed terminal time T 



dY t = f(t, X t , Y t , Z t )dt - Z t dWt, Yt = HX T ) 



(2.1) 



where / : [0, T] x 



X K X 



$ : 

ft 



and X is the Revalued process solution of 



X t = x+ f b(s,X s )ds+ f a(s,X s )dW s 
Jo Jo 



(2.2) 



with b : [0, T]xR d ^ R d and a : [0, T] x R d -> M dx< ?. 

From now on, we assume the following Hypothesis, which ensures the existence and unique- 
ness of the solution to Equations (|2.1|) - (|2.2|) . 

Hypothesis 1 

• The driver f is a bounded Lipschitz continuous function, ie, for all 
(ti, xi, (t 2 ,x 2 ,y 2 ,z 2 ) e[0,T]xR d xRxR, 3L f > 0, 

|/(ii,xi,yi,zi) - f(t 2 ,x 2 ,y 2 ,z 2 )\ < Lf(\ti - t 2 \ + \xi - x 2 \ + \yi - y 2 \ + |zi - z 2 \). 

• o" is uniformly elliptic on [0,T] x R d , ie, there exist two positive constants o~Q,a\ s.t. for 
any £ G M d and any (t, x) G [0, T] x R d 

d 



• $ is bounded in C 2+a , a G]0, 1]. 

• b and a are in C^' 3 and dta is in 

2.2 Link with semilinear PDEs 



Let us also recall the link between BSDEs and semilinear PDEs. Although the relation is 
the keystone of ou r algor ith m, as explained in Sectio n EH we do not develop it and refer to 
Pardoux and Penal (fl~992h or IeI Karoui et al.1 (|l997bh for more details. 



According to iPardoux and Peng! ([1992], Theorem 3.1), we can link the solution (Y,Z) of 
the BSDE (|2TT|) to the solution u of the following PDE: 



{dtu(t, x) + Cu(t, x) + f(t, x, u(t, x), (d x ua)(t, x)) = 0, 
u(T,x) = $(x), 



(2.3) 



where £ is defined by 



C(t,x)V>{t,x) = - ^2[aa*]ij(t,x)dl u(t,x) + ^bi(t,x)d Xi u(t, 



'■j 
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Theorem 1 ( Delarue and Menozzi ( 20061 ). Theorem 2.1). Under Hypothesis^ the solution 



u of PDE belongs to C^ 2 . Moreover, the solution (Y t , Z t )o<t<T of satisfies 

Vte[0,T], (Y t ,Z t ) = (u(t,X t ),d x u(t,X t )o-(t,X t )). (2.4) 

3 Presentation of the Algorithm 

3.1 Description 

We present the algorithm introduced by Gobet and Labart ( 2010l ) to solve standard BSDEs. 



It is based on Picard's iterations combined with an adaptive Monte-Carlo method. We recall 
that we aim at numerically solving BSDE (|2.ip . which is equivalent to solving the semilinear 
PDE (|2.3|) . The current algorithm provides an approximation of the solution of this PDE. 
Let u k denote the approximation of the solution u of (|2.3|) at step k. If we are able to 
compute an explicit solution of (|2.2|) . the approximation of (Y, Z) at step k follows from (|2.4|) : 
{Y k ,Z k ) = (u k (t, X t ),d x u k (t, X t )(r(t, X t )), for all t e [0,T]. Otherwise, we introduce X N the 
approximation of X obtained with a iV— time step Euler scheme: 

Vse [0,T], dX? = b(v N (s),X» N{s) )ds + o-(<p N (s),x£ N{s) )dW s , (3.1) 

where tp N (s) = swp{tj : tj < s} is the largest discretization time not greater than s and 
{0 = to < ti < • • • < tjy = T} is a regular subdivision of the interval [0, T]. Then, we write 

(Y k ,Z k ) = {u k (t,X?\d x u k {t,X?)o-(t,X?)), for all t G [0,T]. 

It remains to explain how to build the approximation (u k )f c of u. The basic idea is the 
following: 

u k+1 = u k + Monte-Carlo evaluations of the error(n — u k ). 

Combining Ito's formula applied to u(s,X s ) and u k (s,X^) between t and T and the 
semilinear PDE (|2.3|) satisfied by u, we get that the correction term c k is given by 

c k (t,x) = (u-u k )(t,x) =E[* (t,x,f u ,$,W)-* N (t,x,-(d t + £ N )u k ,u k (T,.),w) \Q k 

where 

. C N u(8,X?) = ^ id [aa%Ms),X% s) )dl x ^ 
• f v : [0, T] x M. d — > R denotes the following function 

fv(t,x) = f(t,x,v(t,x), (d x va)(t,x)), 



s 



where / is the driver of BSDE (|2.ip . a is the diffusion coefficient of the SDE satisfied 
by X and v : [0,T] x R d — > R is C 1 w.r.t. to its second argument. 



and fy N denote 



*{s,V,gi,g2,W) = J gi (r,Xpy(W))dr + g 2 (X^(W)), 
* N (s, y, 91,92, W) = £ gi (r, X?>'>y(W))dr + g 2 (X^ y (W)), 
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where X s,y (resp. X N,s,y ) denotes the diffusion process solving (|2.2p and starting from 
y at time s (resp. its approximation using an Euler scheme with N time steps), and W 
denotes the standard Brownian motion appearing in (|2.2|) and used to simulate X N , as 
given in (|3,1|) . 

• Q k is the o"-algebra generated by the set of all random variables used to build u k . In the 
above formula of c k , we compute the expectation w.r.t. the law of X and X N and not 
w.r.t. the law of u k , which is Q k measurable. (See Definition [2 for a rigorous definition 
of G k ). 

Note that ^ and ^ N can actually be written as expectations by introducing a random 
variable U uniformly distributed on [0, 1]. 

y, 9i,92, W) = E v \(T - s) gi (s + (T — s)U, X s * s)u (W)) + g 2 (X s ^(W)) 



^ (s, y, 9i,92,W) = E V [(T - s) gi (s + (T — s)U, X?™_ a)u (W)) + g 2 (X^(W)) . 

In the following, let ip N (s, y, g\, g 2 , W, U) denote 

^ N (s,y, gi ,g 2 , W, U) = (T- s) gi (s + (T - s)U, X^_ s)u (W)) + g 2 (X^(W)) (3.2) 

such that V N (s, y, gi ,g 2 ,W) = E n & N (s,y, 9u g 2 ,W, U)\ . 

From a practical point of view, the PDE d23} is solved on [0, T] x V where V C R d such 
that sup 0<t<r \Xt\ 6 V with a probability very close to 1. 

Algorithm 1 We begin with u° = 0. Assume that an approximated solution u k of class C 1 ' 2 
is built at step k — 1. Here are the different steps to compute u k+1 . 

• Pick at random n points (t k ,x k )i<i< n uniformly distributed over [0,T] x T>. 

• Evaluate the Monte-Carlo correction c k at step k at the points (t k ,x k )i<i< n using M 
independent simulations 

M 



c k (tt x k ) = ± jr [P N (t k ,xt U k + (d t + C N )u k , d> - u k , W m -^\ U m < k l 



m=l 



Compute the vector (u (t k ,x k ))i<i< n . Now, we know the vector (u +c )(t k ,x k ))i<i< n . 
From these values, we extrapolate the function u k+1 = u k + c k on [0, T] x T>. 



,fc+l 



(t,x) =V k (u k + c k )(t,x), for (t, x) G [0, T] x V, (3.3) 



where P k is a deterministic operator, which only uses the values of the function at the 
points (t k ,Xi)i<i< n to approximate the function on the whole domain [0, T] x V. The 
choice of the operator P k is discussed in Section [ 



Since c k is computed using Monte-Carlo simulations instead of a true expectation, the 
values (c k (t k , x k ))i<i< n are random variables. Therefore, u k+l is a random function depending 
on the random variables needed to compute u k and W m ' k ' % , U m > k ' % , 1 < m < M, 1 < i < n. 
In view of this comment, the cr-algebra Q k has to be redefined to take into account this new 
source of randomness. 
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Definition 2 (Definition of the cr-algebra Q k ). Let Q k+1 define the a-algebra generated by the 
set of all random variables used to build u k+1 . Using yields 

g k+1 = g k y a(A k ,s k ), 

where A k is the set of random points used at step k to build the estimator V k (see below), 
S k = {W™ 1 ^' 1 , l/™ 1 ^' 1 , 1 < m < M, 1 < i < n}, is the set of independent Brownian motions 
used to simulate the paths X m,k,N (x k ) , and Q k is the a-algebra generated by the set of all 
random variables used to build u k . 



3.2 Choice of the operator 

The most delicate part of the algorithm is how to extrapolate a function h and its derivatives 
when only knowing its values at n points (tii%i)i=l,...,n £ [0>^1 x 2?- 



3.2.1 A kernel operator 

In the first version of Algorithm [T] presented in Gobet and Labart ( 2O10l ). a function h was 



extrapolated from the values computed on the grid by using a kernel operator of the form 

n 

h(t, x) = u(ti,Xi)K t (t - ti)K x (x - Xi), 



where Kt is a one dimensional kernel whereas K x is a product of d one dimensional kernels. 
Hence, evaluating the function h at a given point (t,x) requires Q(n x d) computations. 



The convergence result established by I Gob et and Labart (201C , The orem 5.1) is based 



on the properties of the operator presented in Gobet and LabartT 



20ld . Section 4). Using 



the linearity and the boundedness of the operator, they managed to prove that the errors 
\\v — V k v\\ and \\d x v — d x (V k v)\\ are bounded, which is a key step in proving the convergence of 
the algorithm. At the end of their paper, they present an operator based on kernel estimators 
satisfying the assumptions required to prove the convergence of the algorithm. 



3.2.2 An extrapolating operator 



The numerical properties of kernel operators are very sensitive to the choice of their window 
parameters which is quite hard to tune for each new problem. Hence, we have tried to use an 
other solution. Basically, we have borrowed the solution proposed by iLongstaff and Schwartz 
(|200ll ) which consists in extrapolating a function by solving a least square problem defined by 
the projection of the original function on a countable set of functions. Assume we know the 
values {yi)i=x ... n °f a function h at the points sc$)i=i . n , the function h can be extrapolated 
by computing 



a. 



arg min > 

1=1 



p 

E 

1=1 



aiBi(ti,Xi 



(3.4) 



where (-Bz)z=i,...,p are some real valued functions defined on [0, T] x V. Once a is computed, we 

set h(t, x) = J2f=i a iBi(t, x). For the implementation, we have chosen the (-Bz)z — x clS Si free 

family of multivariate polynomials. For such a choice, h is known to converge uniformly to h 
when p goes to infinity if P is a compact set and h is continuous on [0, T] x T>. Our algorithm 
also requires to compute the first and second derivatives of h which are approximated by the 
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first and second derivatives of h. Although the idea of approximating the derivatives of a 
function by the derivatives of its approximati on is not theoretica l ly we ll justified, it is proved 
to be very efficient in practice. We refer to Wang and Caffish ( 2O10l ) for an application of 



this principle to the computations of the Greeks for American options. 

Practical determination of the vector a In this part, we use the notation d' = d + I. 
It is quite easy to see from Equation (|3.4|) that a is the solution of a linear system. The value 
a is a critical point of the criteria to be minimized in Equation (|3.4|) and the vector en solves 

p n n 

^2ai^2Bi(ti,Xi)Bj(t i} Xi) = ^2yiBj(ti,Xi) for j = l,...,p 

1 = 1 2=1 1=1 

n 

Aa = Y t y i B{U,Xi) (3.5) 

i=l 

where the p x p matrix A = (Ya=i Bi(ti, Xi)Bj(ti, #i))zj=i,...,p and the vector 
B = (B\, . . . , Bp)* . The matrix A is symmetric and positive definite but often 
ill-conditioned, so we cannot rely on the Cholesky factorization to solve the linear system 
but instead we have to use some more elaborate techniques such as a QR factorization with 
pivoting or a singular value decomposition approach which can better handle an almost 
rank deficient matrix. In our i mplem entation of Algorithm [TJ we rely on the routine dgelsy 
from Lapack lAnderson et al. (1999), which solves a linear system in the least square sense 



by using some QR decomposition with pivoting combined with some orthogonalization 
techniques. Fortunately, the ill-conditioning of the matrix A is not fate; we can improve the 
situation by centering and normalizing the polynomials {B{)i such that the domain 
[0,T] x D is actually mapped to [— 1, l] d . This reduction improves the numerical behaviour 
of the chaos decomposition by a great deal. 

The construction of the matrix A has a complexity of 0(p 2 nd'). The computation of a 
(Equation 14. 3|) requires to solve a linear system of size px p which requires 0(p 3 ) operations. 
The overall complexity for computing a is then 0{p i + np 2 d'). 

Choice of the (-Bj)i. The function u k we want to extrapolate at each step of the algorithm 
is proved to be quite regular (at least C 1,2 ), so using multivariate polynomials for the B[ 
should provide a satisfactory approximation. Actually, we used polynomials with d' variates, 
which are built using tensor products of univariate polynomials and if one wants the vector 
space Vect{-B/, I = 1, . . . ,p} to be the space of d'— variate polynomials with global degree 
less or equal than 77, then p has to be equal to the binomial coefficient ( d . For instance, 
for 7} = 3 and d' = 6 we find p = 84. This little example shows that p cannot be fixed by 
specifying the maximum global degree of the polynomials Bi without leading to an explosion 
of the computational cost, we therefore had to find an other approach. To cope with the curse 
of dimensionality, we studied differe nt strateg i es for truncating polynomial chaos expansions. 
We refer the reader to Chapter 2 of B » for a detailed review on the topic. From 



a computational point of view, we could not afford the use of adaptive sparse polynomial 
families because the construction of the family is inevitably sequential and it would have 
been detrimental for the speed-up of our parallel algorithm. Therefore, we decided to use 
spa rse polynomial chaos appro ximation based on an hyperbolic set of indices as introduced 
by iBlatman and Sudretl ifcood ). 
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A canonical polynomial with d! variates can be denned by a multi-index v E N d> — Vi 
being the degree of the polynomial with respect the variate i. Truncating a polynomial chaos 
expansion by keeping only the polynomials with total degree not greater than 77 corresponds 
to the set of multi-indices: {y G N rf ' : J2i=i v i — v}- The idea of hyperbolic sets of indices is 
to consider the pseudo q— norm of the multi-index v with q < 1 

1/9 

<rj}. (3.6) 







< v G N d ' 




- 





Note that choosing q = 1 gives the full family of polynomials with total degree not greater 
than r]. The effect of introducing this pseudo-norm is to favor low-order interactions. 

4 Parallel approach 

In this part, we present a parallel version of Algorithm [TJ which is far from being embar- 
rassingly parallel as a crude Monte-Carlo algorithm. We explain the difficulties encountered 
when parallelizing the algorithm and how we solved them. 

4.1 Detailed presentation of the algorithm 

Here are the notations we use in the algorithm. 

• u k = {u k {t k ,x k )) 1 < l < n er 
. c k = (c k (t k ,x k ))i<i< n et" 

• n: number of points of the grid 

• Ku: number of iterations of the algorithm 

• M: number of Monte-Carlo samples 

• iV: number of time steps used for the discretization of X 

• p: number of functions B[ used in the extrapolating operator. This is not a parameter 
of the algorithm on its own as it is determined by fixing 77 and q (the maximum total 
degree and the parameter of the hyperbolic multi-index set) but the parameter p is of 
great interest when studying the complexity of the algorithm. 

• (-£>/) i<z<p is a family of multivariate polynomials used for extrapolating functions from 
a finite number of values. 

• ot k £MP is the vector of the weights of the chaos decomposition of u k . 

• d! = d + 1 is the number of variates of the polynomials B[ . 



S 



Algorithm 1 Iterative algorithm 



1 

2 
3 
4 
5 
6 

7: 
8: 



u u = 0, a" = 0. 

for k = : K it - 1 do 

Pick at random n points (t k ,x k )i<i< n . 
for j = 1 : n do 

for m = 1 : M do 

Let W be a Brownian motion with values in M. d discretized 
on a time grid with TV time steps. 



Let U ~U\ 
Compute 



[o,i]- 



a i ' k 



i> N f uk + (d t + C N )u k ,$ - u\ W, U) . 



/* We recall that u k (t,x) = Y%=i a iBi(t,x) */ 
end for 



1 



M 



— V a*' fc 
M ^ m 

m=l 



2=1 



(4.1) 
(4.2) 



10: 
11: 



end for 

Compute 



= arg min 



1=1 



K fc + ^)-E«^(^^") 



(4.3) 



12: end for 



4.2 Complexity of the algorithm 

In this section, we study in details the different parts of Algorithm [T] to determine their 
complexities. Before diving into the algorithm, we would like to briefly look at the evaluations 
of the function u k and its derivatives. We recall that 

p 

u k (t,x) =^afB[(t,x) 
i=i 

where the Bi(t,x) are of the form t^ l >° nf=i x i^ lJ and the Pn are some integers. Then the 
computational time for the evaluation of u k (t,x) is proportional to p x d! . The first and 
second derivatives of u k write 

v 

V x u k (t,x) = ^a k V x B l (t,x), 
i=i 

V 2 x u k (t,x) = j2<*iVlB l (t,x), 
i=i 

and the evaluation of V x Bi(t,x) (resp. V x Bi(t,x)) has a computational cost proportional to 
d? (resp. d 3 ). 
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• The computation (at line 6) of the discretization of the d— dimensional Brownian motion 
with N time steps requires O(Nd) computations. 

• The computation of each a^* (line [8]) requires the evaluation of the function u k and its 
first and second derivatives which has a cost 0(pd 3 ). Then, the computation of cf for 
given i and k has a complexity of 0(Mpd 3 ). 

• The computation of a (Equation I4.3|) requires 0(p 3 + np 2 d) operations as explained in 
Section [3X2 

The overall complexity of Algorithm [T] is 0(Ki t nM(pd 3 + dN) + Ki t (p 2 nd + p 3 )). 

To parallelize an algorithm, the first idea coming to mind is to find loops with independent 
iterations which could be spread out on different processors with a minimum of communi- 
cations. In that respect, an embarrassingly parallel example is the well-known Monte-Carlo 
algorithm. Unfortunately, Algorithm [T] is far from being so simple. The iterations of the outer 
loop (line E]) are linked from one step to the following, consequently there is no hope paral- 
lelizing this loop. On the contrary, the iterations over % (loop line ED are independent as are 
the ones over m (loop line [SJ , so we have at hand two candidates to implement parallelizing. 
We could even think of a 2 stage parallelism : first parallelizing the loop over i over a small 
set of processors and inside this first level parallelizing the loop over m. Actually, M is not 
large enough for the parallelization of the loop over m to be efficient (see Section l3.2p . It 
turns out to be far more efficient to parallelize the loop over i as each iteration of the loop 
requires a significant amount of work. 

4.3 Description of the parallel part 

As we have just explained, we have decided to parallelize the loop over i (lineHJin Algorithm [T]). 
We have used a Robbin Hood approach. In the following, we assume to have P + l processors 
at hand with n > P. We use the following master slave scheme: 

1. Send to each of the P slave processors the solution a k computed at the previous step 
of the algorithm. 

2. Spread the first P points (t k , x k )i<i<p to the P slave processors and assign each of them 
the computation of the corresponding c k . 

3. As soon as a processor finishes its computation, it sends its result back to the master 
which in turn sends it a new point (t k ,x k ) at which evaluating ^ N to compute c k and 
this process goes on until all the (cj )i=i,... )r , have been computed. 

At the end of this process, the master knows c k , which corresponds to the approximation of 
the correction at the random points (t k ,x k )\<i< n . From these values and the vector u k , the 
master computes a k+1 and the algorithm can go through a new iteration over k. 
What is the best way to send the data to each slave process? 

• Before starting any computations, assign to each process a block of iterations over i 
and send the corresponding data all at once. This way, just one connection has to 
be initialised which is faster. But the time spent by the master to take care of its 
slave is longer which implies that at the beginning the slave process will remain longer 



10 



unemployed. When applying such a strategy, we implicitly assume that all the iterations 
have the same computational cost. 



• Send data iteration by iteration. The latency at the beginning is smaller than in the 
block strategy and performs better when all iterations do not have the same computa- 
tional cost. 

Considering the wide range of data to be sent and the intensive use of elaborate structures, 
the most natural way to pass these objects was to rely on the packing mechanism of MPI. 
Moreover, the library we are using in the code (see Section 14. 5|) already has a MPI binding 
which makes the manipulation of the different objects all the more easy. The use of packing 
enabled to reduce the number of communications between the master process and a slave 
process to just one communication at the beginning of each iteration of the loop over % (lined] 
of Algorithm [T]) . 

4.4 Random numbers in a parallel environment 

One of the basic problem when solving a probabilistic problem in parallel computing is the 
generation of random numbers. Random number generators are usually devised for sequential 
use only and special care should be taken in parallel environments to ensure that the sequences 
of random numbers generated on each processor are independent. We would like to have 
minimal communications between the different random number generators, ideally after the 
initialisation process, each generator should live independently of the others. 
Several strategies exist for that. 

1. Newbies in parallel computing might be tempted to take any random number generator 
and fix a different seed on each processor. Although this naive strategy often leads 
to satisfactory results on toy examples, it can induce an important bias and lead to 
detrimental results. 

2. The first reasonable approach is to split a sequence of random numbers across sev- 
eral processors. To efficiently implement this strategy, the generator must have some 
splitting facilities su ch that there is no need to draw all the samples p rior to any compu- 
tations. We refer to iL'Ecuver and Cotel (|l99lh : lL'Ecuver et al.l(|2002h for a presentation 



of a generator with splitting facilities. To efficiently split the sequence, one should know 
in advance the number of samples needed by each processor or at least an upper bound 
of it. To encounter this problem, the splitting could be made in substreams by jumping 
ahead of P steps at each call to the random procedure if P is the number of processors 
involved. This way, each processor uses a sub-sequence of the initial r andom number 



sequen ce rather than a contiguous part of it. However, as noted by lEntacher et al. 



.ong range correlations in the original sequence can become short range corre 



lations between different processors when using substreams. 

Actually, the best way to implement splitting is to use a generator with a huge period 
such as the Mersenne Twister (its period is 2 19 — 1) and divide the period by a million 
or so if we think we will not need more than a million independent substreams. Doing 
so, we come up with substreams which still have an impressive length, in the case of 
the Mersenne Twister each substream is still about 2 19917 long. 
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3. A totally different approach is to find generators which can be easily parametrised 
and to compute sets of parameters ensuring the statistical independence of the related 
generators. Several ge nerators o f fer su ch a facility such as the ones included in the 
SPRNG package (see Mascagni ( 19971 ) for a detailed presentation of the generators 
implemente d in this package) or the dynamic ally created Mersenne Twister (DCMT in 
short), see iMatsumoto and Nishimural (|2000l ). 



For our experiments, we have decided to use the DCMT. This generator has a sufficiently 
long period (2 521 for the version we used) and we can create at most 2 16 = 65536 independent 
generators with this period which is definitely enough for our needs. Moreover, the dynamic 
creation of the generators follows a deterministic process (if we use the same seeds) which 
makes it reproducible. The drawback of the DCMT is that its initialization process might be 
quite lengthy, actually the CPU time needed to create a new generator is not bounded. We 
give in Figure [T] the distribution. 




Figure 1: Distribution of the CPU time needed for the creation of one Mersenne Twister 
generator 



4.5 The library used for the implementation 

Our code has been implemented in C using the PNL library (see iLelond (12007-2011^ . This 
is a scientific library available under the Lesser General Public Licence and it offers various 
facilities for implementing mathematics and more recently some MPI bindings have been 
added to easily manipulate the different objects available in PNL such as vectors and matrices. 
In our problem, we needed to manipulate matrices and vectors and pass them from the master 
process to the slave processes and decided to use the packing facilities offered by PNL through 
its MPI binding. The technical part was not only message passing but also random number 
generation as we already mentioned above and PNL offers many functions to generate random 
vectors or matrices using several random number generators among which the DCMT. 

Besides message passing, the algorithm also requires many other facilities such as multi- 
variate polynomial chaos decomposition which is part of the library. For the moment, three 
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families of polynomials (Canonical, Hermite and Tchebichev polynomials) are implemented 
along with very efficient mechanism to compute their first and second derivatives. The im- 
plementation tries to make the most of code factorization to avoid recomputing common 
quantities several times. The polynomial chaos decomposition toolbox is quite flexible and 
offers a reduction facility such as described in Section 13.2.21 which is completely transparent 
from the user's side. To face the curse of dimensionality, we used sparse polynomial families 
based on an hyperbolic set of indices. 

5 Pricing and Hedging American options in local volatility 
models 

In this Section, we present a method to price and hedge American options by using Algo- 
rithm [U which solves standard BSDEs. 



5.1 Framework 

Let (0, J 7 , P) be a probability space and (Wt)t>o a standard Brownian motion with values in 
M. d . We denote by (J 7 t)t>o the P-completion of the natural filtration of (Wt)t>o- 
Consider a financial market with d risky assets, with prices S} , • • ■ ,Sf at time t, and let Xf 
be the d-dimensional vector of log-returns X\ = log 5|, for i = 1, • • ■ , d. We assume that (X t ) 
satisfies the following stochastic differential equation: 

( 1 d \ d 

dX\ = f r(t) - Si(t) - - J2 °ij 2 (t, e Xt )\ dt + J2 M*> e Xt )dWl , i = 1, ■ • • , d (5.1) 

on a finite interval [0,T], where T is the maturity of the option. We denote by a 
continuous version of the flow of the stochastic differential Equation (|5.1|) . X l t ,x = x almost 
surely. 

In the following, we assume 
Hypothesis 2 

1. r : [0, T] i — >R is a function. 5 : [0, T] i — >R d is a function. 

2. a : [0, T]xR d i — ► R dxd is a C]' 2 function. 

3. a satisfies the following coercivity property: 

d 

3e > 0V(i,x) G [0,71 x M d ,V£ 6R d ^ [<ra*]ij(t,x)^ >eJ2$ 

l<i,j<d i=l 



We are interested in computing the price of an American option with payoff <&(Xt), where 
: R d i — > M + is a continuous function (in case of a put option, $(x) = (K 



\{e x ^ + ••• + 



From l.Taillet et all <jl99(t ). we know that if 
Hypothesis 3 $ is continuous and satisfies \<&(x)\ < Me M ^ for some M > 0, 
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the price at time t of the American option with payoff is given by 
V(t,X t ) = esssup Te7 - E (e-ft r{s)ds ${X T )\T t 



where % T is the set of all stopping times with values in [t, T] and V : [0, T] x M. d 
defined by V(t,x) = sup TeT E [ e~ ft r ^ da ^{Xp x )). 



We also introduce the amount A(t, Xt) involved in the asset at time t to hedge the American 
option. The function A is given by A(t,x) = V x V(t,x). 

In order to link American option prices to BSDEs, we need three steps: 

1. writing the price of an American option as a solution of a variational inequality (see 
Section 15. 2|) 

2. linking the solution of a variational inequality to the solution of a reflected BSDE (see 
Section 15.31) 



3. approximating the solution of a RBSDE by a sequence of standard BSDEs (see Section 

EM 



We refer to Jaillet et al. ( 1990l ) for the first step and to El Karoui et al. ( 1997al ) for the 



second and third steps. 

5.2 American option and Variational Inequality 

First, we recall the variational inequality satisfied by V. We refer to Jaillet et al. ( 1990l . 



Theorem 3.1) for more details. Under Hypotheses [2] and [31 V solves the following parabolic 
PDE 

!max($(x) — u(t, x),dtu(t, x) + Au(t, x) — r(t)u(t, x)) = 0, , , 

<Z» = *(x). [ ' 

where 

Au(t, x) = EtMt) ~ St(t) - \ ZU eX )) d *At, x) + \ Ei^M^hit, e x )d 2 XlX] u(t, x) 
is the generator of X. 

5.3 Variational Inequality and Reflected BSDEs 



This section is based on lEl Karoui et al.1 ()1997al . Theorem 8.5). We assume 



Hypothesis 4 $ is continuous and has at most a polynomial growth (ie, 3C,p > s.t. 
\§{x)\ < C(l + |x| p );. 

Let us consider the following reflected BSDE 

'-dY t = -r(t)Y t dt - Z t dW t + dH t , 

Y T = $(X T ),Y t >*(X t ),Vt, (5.3) 
[fi(Y t -<S>(X t ))dH t = 0. 
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Then, under Hypotheses [5] and HI u(t, x) = Y t ' x is a viscosity solution of the obstacle problem 
(15. 2p . where Y^' x is the value at time t of the solution of (|5.3p on [t, T] where the superscripts 
t,x mean that X starts from x at time t. We also have (V x u(t, x))*cr(t, x) = Z\ ,x (* means 
transpose). Then, we get that the price V and the delta A of the option are given by 

V(t,X t ) = Y t , A(t,X t ) = (Zta^Xt)- 1 )*. 



5.4 Approximation of a RBSDE by a sequence of standard BSDEs 

We present a way of approximating a R BSDE by a sequence of standard BSDEs. The idea 



was introduced by lEl Karoui et al.l ()1997al . Section 6) for proving the existence of a solution 
to RBSDE by turning the constraint Yt > <&(Xt) into a penalisation. Let us consider the 
following sequence of BSDEs indexed by i 



Y; = $(X T ) - [ T r(s)Y:ds + i /Vs " HX s ))-ds - [ T ZidW s , (5.4) 
Jt Jt Jt 

whose solutions are denoted {Y\ Z l ). We define H\ =ij t 2 (Y* - <f>(X s ))~ds. Under Hypothe- 
ses Eland H the sequence {Y\ Z\H% converges to the solution (Y, Z, H) of the RBSDE ([SSD , 
when i goes to infinity. Moreover, Y % converges increasingly to Y . The term H l is often called 
a penalisation. From a practical point, there is no use solving such a sequence of BSDEs, 
because we can directly apply our algorithm to solve Equation (|5.4|) for a given i. Therefore, 
we actually consider the following penalized BSDE 

Y t = $(X T ) - [ T r(s)Y s ds + u [ T (Y S - <f>(X s ))-ds - F Z s dW s , (5.5) 
Jt Jt Jt 

where uj > is penalization weight. In practice, the magnitude of cv must remain reasonably 
small as it appears as a contraction constant when studying the speed of convergence of our 
algorithm. Hence, the larger ui is, the slower our algorithm converges. So, a trade-off has 
to be found between the convergence of our algorithm to the solution Y of (|5.5|) and the 
accuracy of the approximation of the American option price by Y. 



5.5 European options 

Let us consider a European option with payoff $(Xt), where X follows (|5.1|) . We denote 
by V the option price and by A the hedging strategy associated to the option. From 
El Karoui et al. ( 1997bl ). we know that the couple (V, A) satisfies 



-dV t = -r(t)V t dt - A*a(t,X t )dWt, V T = *(X T ). 

Then, (V, A) is solution of a standard BSDE. This corresponds to the particular case cv 
of (|5.5j) . Y corresponding to V and Z to A£er(t, Xf). 



6 Numerical results and performance 
6.1 The cluster 

All our performance tests have been carried out on a 256— PC cluster from SUPELEC Metz. 
Each node is a dual core processor : INTEL Xeon-3075 2.66 GHz with a front side bus at 
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1333Mhz. The two cores of each node share 4GB of RAM and all the nodes are interconnected 
using a Gigabit Ethernet network. In none of the experiments, did we make the most of the 
dual core architecture since our code is one threaded. Hence, in our implementation a dual 
core processor is actually seen as two single core processors. 

The accuracy tests have been achieved using the facilities offered by the University of 
Savoie computing center MUST. 



6.2 Black-Scholes' framework 



We consider a d— dimensional Black-Scholes model in which the dynamics under the risk 
neutral measure of each asset S l is supposed to be given by 



dS\ = Si((r - 6i)dt + a'dWl) S = . . . , S d ) 



(6.1) 



where W = (W 1 , . . . , W ). Each component W l is a standard Brownian motion. For 
the numerical experiments, the covariance structure of W will be assumed to be given by 
{W i ,W^)t = ptl{i^j} + tlu = j\. We suppose that p S (— which ensures that the 

matrix C = (pl^j} + l{i=j})i<i,j<d is positive definite. Let L denote the lower triangular 
matrix involved in the Cholesky decomposition C = LL* . To simulate W on the time-grid 
< t\ < t2 < ■ ■ ■ < i/Vj we need d x N independent standard normal variables and set 















V w tN J 














\ 



tN-2L 



G, 



\\ft[L — t\L . . . y/tN-i — tN-%L y/tfi — tN-iL) 

where G is a normal random vector in M. dxN . The vector a = (a 1 , . . . ,a d ) is the vector of 
volatilities, 5 = (S 1 , . . . ,5 d ) is the vector of instantaneous dividend rates and r > is the 
instantaneous interest rate. We will denote the maturity time by T. Since we know how to 
simulate the law of (St, St) exactly for t < T, there is no use to discretize equation 
using the Euler scheme. In this Section N = 2. 



6.2.1 European options 



We want to study the numerical accuracy of our algorithm and to do that we first consider 
the case of European basket options fo r which we can compute be nchmark price by using 
very efficient Monte-Carlo methods, see Jourdain and Lelong (2009) for instance, while it is 
no more the case for American options. 
In this paragraph, the parameter oj appearing in (|5.5p is 0. 



European put basket option. Consider the following put basket option with maturity T 

(6.2) 



H§4 



Figure [5] presents the influence of the parameters M, n and q. The results obtained for 
n = 1000, M = 50, 000 and q = 1 (curve (+)) are very close to the true price and moreover 
we can see that the algorithm stabilizes after very few iterations (less than 10). 
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■ n=1000 M=50000 q = l 

n=1000 M=5000 q = l 

n=1000 M=50000 q=0.4 

n=2000 M=5000 q=l 
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_1_ 



7 9 11 13 

Number of iterations (KJt) 



19 



Figure 2: Convergence of the algorithm for a European put basket option with d = 5, p = 0.1, T = 3, 
Sq = 100, a = 0.2, 6 = 0, r = 0.05 K = 100, r\ = 3, oj = 0. The benchmark price computed with a 
high precision Monte-Carlo method yields 2.0353 with a confidence interval of (2.0323, 2.0383). 

• Influence of M: curves (+) and (x) show that taking M = 5000 is not enough to get 
the stabilization of the algorithm. 

• Joined influence of n and M: curves (x) and (♦) show the importance of well balancing 
the number of discretization points n with the number of Monte-Carlo simulations 
M. The sharp behaviour of the curve (♦) may look surprising at first, since we are 
tempted to think that a larger numbers of points n will increase the accuracy. However, 
increasing the number of points but keeping the number of Monte-Carlo simulations 
constant creates an over fitting phenomenon because the Monte-Carlo errors arising at 
each point are too large and independent and it leads the approximation astray. 

• Influence of q: we can see on curves (+) and (*) that decreasing the hyperbolic index 
q can lead a highly biased although smooth convergence. This highlights the impact of 
the choice of q on the solution computed by the algorithm. 

To conclude, we notice that the larger the number of Monte-Carlo simulations is, the 
smoother the convergence is, but when the polynomial family considered is too sparse it can 
lead to a biased convergence. 

European call basket option. Consider the following put basket option with maturity T 
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Figure 3: Convergence of the price of a European call basket option with d = 10, p = 0.2, T = 1, 
S = 100, a = 0.2, 6 = 0, r = 0.05, K = 100, n = 1000 and uj = 0. The benchmark price computed 
with a high precision Monte-Carlo method yields 7.0207 with a confidence interval of (7.0125, 7.0288). 

Figure illustrates the impact of the sparsity of the polynomial basis considered on the 
convergence of the algorithm. The smoothest convergence is achieved by the curve (+), ie 
when M = 30, 000, r\ = 5 and q = 0.6. The algorithm stabilizes very close to the true price 
and after very few iterations. 

• Influence of rj : for a fixed value of q, the sparsity increases when r\ decreases, so the 
basis with rj = A,q = 0.6 is more sparse than the one with r/ = 5,q = 0.6. We compare 
curves (+) (ry = 5) and (x) (rj = 4) for fixed values of q (= 0.6) and M (= 30, 000). We 
can see that for rj = 4 (curve (x)) the algorithm stabilizes after 7 iterations, whereas 
for rj = 6 (curve (+)) less iterations are needed to converge. 

• Influence of M : for fixed values of rj (= 5) and q (= 0.6), we compare curves (+) 
(M = 30000) and (*) (M = 5000). Using a large number of simulations is not enough 
to get a good convergence, as it is shown by curve (*). 

Actually, when the polynomial basis becomes too sparse, the approximation of the solution 
computed at each step of the algorithm incorporates a significant amount a noise which has 
a similar effect to reducing the number of Monte-Carlo simulations. This is precisely what 
we observe on Figure O the curves (x) and (*) have a very similar behaviour although one 
of them uses a much larger number of simulations. 

6.2.2 American options 

In this paragraph, the penalisation parameter u) appearing in (|5.5p is 1. 



18 



Pricing American put basket options. We have tested our algorithm on the pricing of 
a multidimensional American put option with payoff given by Equation (|6.2p 



h M=5000 n=1000 eta=3 q=l 

X M=5000 n=1000 eta=5 q=0.8 

* M=5000 n=1000 eta=5 q=0.6 

♦ M = 30000 n=2000 eta = 3 q = l 

A M=30000 n=2000 eta = 5 q=0.8 

M = 30000 n=2000 eta = 5 q=0.6 




Number of iterations (K_it) 



Figure 4: Convergence of the price of an American put basket option with d — 5, p — 0.2, T = 1, 
S = 100, a = 0.25, S = 0.1, K = 100, r = 0.05 and ui = 1. 

Figure S] presents the influence of the parameters M and q. 

• Influence of M : when zooming on Figure HI one can indeed see that the curves using 
30000 Monte-Carlo simulations are a little smoother than the others but these extra 
simulations do not improve the convergence as much as in Figures [2] and [3] (compare 
curves (+) and (♦), Figured]). The main explanation of this fact is that put options 
have in general less variance than call options and in Figure [2] a maturity of T = 3 was 
used which leads to a larger variance than with T = 1. 

• Influence of q : once again, we can observe that increasing the sparsity of the polynomial 
basis (ie, decreasing q) can lead to a biased convergence. When q = 0.6, we get a biased 
result (see curves (*) and (a)), even for M large (curve (a), M = 30000). 

Then, it is advisable for American put options to use almost full polynomial basis with 
fewer Monte-Carlo simulations in order to master the computational cost rather than doing 
the contrary. 



Hedging American put basket options. Now, let us present the convergence of the 
approximation of the delta at time 0. Table [T] presents the values of the delta of an American 
put basket option when the iterations increase. We see that the convergence is very fast (we 
only need 3 iterations to get a stabilized value). The parameters of the algorithm are the 
following ones: n = 1000, M = 5000, q = 1, rj = 3 and oj = 1. 
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Iteration 


A 1 


A^ 


A^ 


A 4 


A 5 


1 


-0.203931 


-0.205921 


-0.203091 


-0.205264 


-0.201944 


2 


-0.105780 


-0.102066 


-0.103164 


-0.102849 


-0.108371 


3 


-0.109047 


-0.105929 


-0.105604 


-0.105520 


-0.111327 


4 


-0.108905 


-0.105687 


-0.105841 


-0.105774 


-0.111137 


5 


-0.108961 


-0.105648 


-0.105725 


-0.105647 


-0.111274 



Table 1: Convergence of the delta for an American put basket option with d = 5, p = 0.2, T — 1, 
S = 100, a = 0.25, 8 = 0.1, K = 100, r = 0.05. 



6.3 Dupire's framework 

We consider a (i-dimensional local volatility model in which the dynamics under the risk- 
neutral measure of each asset is supposed to be given by 

dS\ = S l t ((r - 6i)dt + a(t, SftdW?) S = (5 \ . . . , S ") 

where W = (W , . . . , W d ) is defined and generated as in the Black-Scholes framework. The 
local volatility function a we have chosen is of the form 

a(t, x) = 0.6(1.2 - e -o.it e -o.ooi(^-«)» )e -o.OBV* j {%A) 

with s > 0. Since there exists a duality between the variables (t, x) and (T, K) in Dupire's 
framework, one should choose s equal to the spot price of the underlying asset. Then, the 
bottom of the smile is located at the forward money. The parameters of the algorithm in this 
paragraph are the following : n = 1000, M = 30000, N = 10, q = 1, 77 = 3. 



Pricing and Hedging European put basket options. We consider the put basket option 
with payoff g iven by (16.20. The benchma rk price and delta are computed using the algorithm 
proposed by Jourdain and Lelong (2009), which is based on Monte-Carlo methods. 

Concerning the delta, we get at the last iteration the following vector A = (—0.062403 — 
0.061271 - 0.062437 - 0.069120 - 0.064743). The benchmark delta is -0.0625. 



Pricing and Hedging American put basket options. We still consider the put payoff 
given by (|6.2|) . In the case of American options in local volatility models, there is no bench- 
mark. However, Figure [6] shows that the algorithm converges after few iterations. We get 
a price around 6.30. At the last iteration, we get A = (-0.102159 - 0.102893 - 0.103237 - 
0.110546 - 0.106442). 



6.4 Speed up. 

Remark 3. Because in high dimensions, the sequential algorithm can run several hours before 
giving a price, we could not afford to run the sequential algorithm on the cluster to have a 
benchmark value for the reference CPU time used in the speed up measurements. Instead we 
have computed speed ups as the ratio 

CPU time for 8 processors / 8 

speed up = — — 6.5 

CPU time for n processors x n 

This explains why we may get in the tables below some speed ups slightly greater than 1. 
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app pnce 
ref price 



Figure 5: Convergence of the algorithm for a European put basket option with d = 5, p = 0, T = 1, 
So = 100, S = 0, K = 100, r = 0.05, r\ = 3, co = 0. The benchmark price computed with a high 
precision Monte-Carlo method yields 1.745899 with a confidence interval of (1.737899, 1.753899). 




Figure 6: Convergence of the algorithm for an American put basket option with d = 5, p = 0, T = 1, 
So = 100, (5 = 0.1, K = 100, r = 0.05 and u = 1. 



Our goal in this paper was to design a scalable algorithm for high dimensional problems, 
so it is not surprising that the algorithm does not behave so well in relatively small dimension 
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as highlighted in Table In dimension 3, the speed ups are linear up to 28 processors but 
then they dramatically decrease toward zero: this can be explained by the small CPU load 
of the computation of the correction term at a given point (tj, x,). The cost of each iteration 
of the loop line 2] of Algorithm [T] is proportional to Mpd 3 and when d is small so is p - 
the number of polynomials of total degree less or equal than rj. For instance, for d = 3 and 
r] = 3, we have p = 20, which gives a small complexity for each iteration over i. Hence, when 
the number of processors used increases, the amount of work to be done by its processor 
between two synchronisation points decreases to such a point that most of the CPU time is 
used for transmitting data or waiting. This explains why the speed ups decrease so much. 
Actually, we were expecting such results as the parallel implementation of the algorithm has 
been designed for high dimensional problems in which the amount of work to be done by 
each processor cannot decrease so much unless several dozens of thousands of processors are 
used. This phenomena can be observed in Table which shows very impressive speed ups: 
in dimension 6, even with 256 processors the speed ups are still linear which highlights the 
scalability of our implementation. Even though computational times may look a little higher 
than with other algorithms, one should keep in mind that our algorithm not only computes 
prices but also hedges, therefore the efficiency of the algorithm remains quite impressive. 



Nb proc. 


Time 


Speed up 


8 


543.677 


1 


16 


262.047 


1.03737 


18 


233.082 


1.03669 


20 


210.114 


1.03501 


24 


177.235 


1.02252 


28 


158.311 


0.981206 


32 


140.858 


0.964936 


64 


97.0629 


0.70016 


128 


103.513 


0.328267 


256 


162.936 


0.104274 



Table 2: Speed ups for the American put option with d = 3, r = 0.02, T = 1, a = 0.2, p = 0, 
= 100, K = 95, M = 1000, N = 10, K = 10, n = 2000, r = 3, q = 1, u = 1. See Equation (g]>J) for 
the definition of the "Speed up" column. 



7 Conclusion 

In this work, we have presented a parallel algorithm for solving BSDE and applied it to the 
pricing and hedging of American option which remains a computationally demanding problem 
for which very few scalable implementations have been studied. Our parallel algorithm shows 
an impressive scalability in high dimensions. To improve the efficiency of the algorithm, we 
could try to refactor the extrapolation step to make it more accurate and less sensitive to the 
curse of dimensionality. 
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Nb proc. 


Time 


Speed up 


8 


1196.79 


1 


16 


562.888 


1.06308 


24 


367.007 


1.08698 


32 


272.403 


1.09836 


40 


217.451 


1.10075 


48 


181.263 


1.10042 


56 


154.785 


1.10457 


64 


135.979 


1.10016 


72 


121.602 


1.09354 


80 


109.217 


1.09579 


88 


99.6925 


1.09135 


96 


91.9594 


1.08453 


102 


85.6052 


1.0965 


110 


80.2032 


1.08523 


116 


75.9477 


1.08676 


128 


68.6815 


1.08908 


256 


35.9239 


1.04108 



Table 3: Speed ups for the American put option with d = 6, r = 0.02, T = 1, a = 0.2, p = 0, 
So = 100, K = 95, M = 5000, N = W,K= 10, n = 2000, r = 3, q = 1, u = 1. See Equation §1$ for 
the definition of the "Speed up" column. 
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